-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insert size filtering #214
Conversation
@@ -870,7 +878,7 @@ Other options: | |||
stderr.write_line("[mosdepth] error alignment file must be indexed") | |||
quit(2) | |||
|
|||
var opts = SamField.SAM_FLAG.int or SamField.SAM_RNAME.int or SamField.SAM_POS.int or SamField.SAM_MAPQ.int or SamField.SAM_CIGAR.int | |||
var opts = SamField.SAM_FLAG.int or SamField.SAM_RNAME.int or SamField.SAM_POS.int or SamField.SAM_MAPQ.int or SamField.SAM_CIGAR.int or SamField.SAM_TLEN.int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed this was necessary but I haven't actually tested if it's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if it is correct, we can remove the comment in line 883
Otherwise, this PR might cause reads with insert size 0 to be excluded, which might be okay but is not the intended change I want to make
Since I use paired-end sequencing, insert size is the right way to get the fragment length. But for single-read sequencing it might not be? What's the best approach to handling this? |
Hi @LudvigOlsen , nice job figuring this out. Indeed this looks like you've handled it cleanly and correctly.
as to the PR, and the general reason I'm hesitant to add features. This feature makes sense, and now you've implemented it, but now I need to maintain, test, and document this. And it adds extra load to the user to have 2 additional parameters that will very rarely be used. That load translates into additional questions. Let me think about what to do here. |
Hi @brentp Good to hear! I think I've seen template lengths of 0 before, so I was worried you might get an insert size of 0 in single-read data. I understand it's more complex than just the implementation I've suggested. And I really appreciate your work on My colleagues and I, across multiple research groups, tend to need this functionality. And the current approach is to subset large bam files for every experiment, which is a bit of a hassle. If I can help with the testing and documentation of these features, do let me know. I'm very new to nim, but do have extensive experience with unit/regression testing. |
@LudvigOlsen, it could be a very useful feature. Thank you! |
@LudvigOlsen would you write a test for this? you can add to |
Hi @brentp Sure! I added the error and updated the argument names. If you want fragments with a specific fragment length, I guess min and max should be allowed to be equal? I will look at the testing soon |
Ah, yes, that's right. For testing, you can just use one of the current test bams and manually find, for example an exact fragment length, then add that as the test. |
thank you! |
@brentp I have added some tests. First time with shell-based tests but I think, I figured it out. Perhaps try running them on your side as well though. Any more tests you want added? :-) |
Thanks very much @LudvigOlsen! I'll get out a new release next week. |
Since I needed the insert size filtering (#213, #122) for my current analysis, I figured out the
nim
thing and made it work! :-)It adds two arguments to the CLI:
--min_len
and--max_len
. I chose the lettersl
for lower andu
for upper.I tested it on a file and it seemed to work. I asked for fragments of size 110-115 and looked in the
*per-base.bed
file. All the intervals with a 1 count had a length between 110-115 (majority) or smaller which I guess is in case of overlapping fragments.I used:
cat between_110_115.per-base.bed | awk '$4 == 1 {print $1,$2,$3,$3-$2,$4}' | head -n500
and got:
(Second last column is the interval length)
I've run with each of them separately, which also works.